Systematic Biology — Latest Matching Preprints

1

An Extended Clade Framework for Annotated Trees in the Context of Phylogeography and Transmission Tree Inference

Berling, L.; Colijn, C.

2026-04-27 bioinformatics 10.64898/2026.04.23.720428 medRxiv

Top 0.1%

78.6%

Show abstract

Bayesian phylogenetic inference produces large samples from a posterior distribution over phylogenetic trees that represents uncertainty in both tree topology and associated variables. Such a collection of trees is hard to interpret and it is common practice to summarize such samples into a single representative tree. Methods for constructing representative trees have largely been restricted to plain tree topologies, encoding only relationships among taxa. Inference with more sophisticated models produce annotated tree objects. These have additional information representing nodes locations in the case of phylogeography, host information when inferring transmission trees, or sampled ancestor status when incorporating fossil information. Nevertheless, these annotated representations are reduced to a single representative tree, typically using methods developed for plain tree topologies and without accounting for the resulting methodological mismatch. Here, we introduce the concept of an extended clade and investigate an extension of the conditional clade distribution (CCD) model. Through motivating examples and case studies in discrete trait phylogeography and transmission tree reconstruction, we demonstrate limitations of standard summary tree approaches and show how these can be addressed using an extended CCD framework that explicitly incorporates the annotated tree structure.

2

Phylogenetic inference from an incomplete fossil record

Hohmann, N.; Warnock, R. C. M.; Jarochowska, E.

2026-06-28 paleontology 10.64898/2026.06.24.734220 medRxiv

Top 0.1%

66.8%

Show abstract

Fossil data is crucial to construct phylogenetic time trees, which serve as the basis to test a wide range of evolutionary hypotheses. While the fossil record is known to be incomplete, modern stratigraphy provides predictions of the structure of the fossil record as expressed by gap location and duration. Advances in phylogenetic model development allow us to propagate this information into Bayesian phylogenetic inference in the form of priors on time-variable fossil sampling. However, the impact and role of stratigraphic architectures on time tree inference has so far remained unexplored. We introduce a novel simulation framework that combines realistic stratigraphic forward models with phylogenetic simulations. Using this framework, we examine (1) how stratigraphically plausible model violations of fossil sampling due to gaps affect total-evidence inference under the fossilized birth-death model and (2) if stratigraphic knowledge on gap duration and timing improves inference when incorporated in priors on fossil sampling. We find that total-evidence analysis is robust to stratigraphically plausible distribution of gaps in disparate stratigraphic architectures, with results being instead dominated by the number of morphological characters. Surprisingly, incorporating information on prominent gaps in the stratigraphic record does not improve phylogenetic inference. Our results suggest that phylogenetic inference is robust to model violations introduced by stratigraphic gaps over short timescales, with results being dominated by a priori known data availability constraints such as morphological character matrix size. This research establishes the foundations for joint modeling of phylogenetic and stratigraphic processes and narrows the knowledge gap between paleontology, stratigraphy, and neontology.

3

New approaches to detecting and characterizing introgression in large species trees

Mishra, S.; Pomar-Pallares, L.; Lanfear, R.; Hahn, M. W.

2026-06-01 evolutionary biology 10.64898/2026.05.30.728990 medRxiv

Top 0.1%

65.7%

Show abstract

Many current phylogeny-based methods to detect introgression use samples of species-quartets to detect asymmetries in gene tree frequencies. While this has proven to be an accurate and robust approach, applying it to larger species trees often means having to test dozens to hundreds of quartets across a tree. Furthermore, any single introgression event can have effects on multiple quartets--with no principled way to determine the number of unique events from a set of quartets--and the direction of introgression cannot always be determined from quartet comparisons alone. Here, we present a new approach to detecting introgression using the frequency with which more distantly related clades are attached to one another among a set of gene trees. Testing for introgression between pairs of branches is straightforward using these discordant attachment frequencies. We further show that the direction of introgression can be inferred between any pair of branches separated by at least two internal branches of the species tree, and that theoretical expectations of gene tree frequencies under introgression can be used to accurately determine the number of independent times genes have been exchanged. Application of these methods to data from cichlids and Drosophila demonstrate the power of the new approaches. The DAFT software package is available from: https://github.com/smishra677/DAFT/

4

EMPIRE: The Ellipse Model for Phylogenetic Inference of Range Evolution

Swiston, S. K.; McHugh, S. W.; Landis, M. J.

2026-04-24 evolutionary biology 10.64898/2026.04.23.720387 medRxiv

Top 0.1%

61.3%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWMany phylogenetic models of historical biogeography exist for describing how lineages move and evolve over time. Here, we present the Ellipse Model for Phylogenetic Inference of Range Evolution (O_SCPLOWEMPIREC_SCPLOW), which models the movement and splitting of species range ellipses in continuous space, summarizing important attributes of each range, such as its position, size, and orientation. The framework allows us to reconstruct ancestral range ellipses, investigate rates governing important processes like movement, expansion, and elongation, and examine the spatial context of speciation, including asymmetric range inheritance at cladogenesis. We apply O_SCPLOWEMPIREC_SCPLOW to the Australian Sphenomorphinae, a group of skinks whose diversification has coincided with substantial climatic change over the past ~36 million years. We find that speciation events are positively associated with aridification, while daughter lineages post-speciation do not tend to show evidence of ecological partitioning.

5

Phylogenetically Dispersed Subsetting for Species-Level Machine Learning Evaluation: Dependence-Aware Validation and Limited Effective Information

Huang, R.; Qi, B.; Niu, D.-K.

2026-05-26 evolutionary biology 10.64898/2026.05.22.727088 medRxiv

Top 0.1%

59.1%

Show abstract

Machine learning is increasingly applied to species-level biological data, but phylogenetic autocorrelation can make evaluation species statistically non-independent, violating the assumption of independence in model evaluation and potentially leading to overconfident performance claims through phylogenetic interpolation. We present a dependence-aware framework, implemented in the R package PhyloSubset, for constructing phylogenetically dispersed species subsets from a user-defined candidate pool. The framework treats subset construction as an optimization problem based on distance-based criteria that capture closest-pair separation, overall phylogenetic spread, and nearest-neighbor spacing. By changing the optimization objective, the same framework can also construct phylogenetically clustered subsets as high-dependence contrast cases. Selected subsets are evaluated against empirical null distributions generated by repeated random sampling and are further assessed using diagnostics derived from the within-subset correlation structure, including mean-based effective sample size (MeanESS). Using Carnivora and Cricetidae as empirical case studies, we show that dispersed and clustered subsets occupy opposite tails of random-subset distributions for both distance-based metrics and covariance-based dependence diagnostics. However, phylogenetically dispersed subsetting reduced but did not eliminate internal dependence: in the Carnivora example, a nominal 20-species dispersed subset had a MeanESS of only 4.66 under a Brownian-motion covariance structure, and MeanESS exceeded half of the nominal subset size only when the assumed phylogenetic covariance was substantially weakened. These results show that phylogenetically dispersed subsetting can provide stricter and more reproducible evaluation subsets, while also revealing how little effective information may remain in species-level benchmarks. More broadly, PhyloSubset provides a practical foundation for dependence-aware validation strategies in species-level machine learning.

6

Substitution rate variation, not hidden paralogy, drives false hybridization signal in phylogenetic network inference

Li, B.; Ane, C.

2026-05-18 evolutionary biology 10.64898/2026.05.11.723986 medRxiv

Top 0.1%

55.9%

Show abstract

Phylogenetic network inference methods are increasingly used to detect hybridization and gene flow from genomic data, but their robustness to common sources of model violation remains poorly characterized. We conducted a simulation study to evaluate the effects of hidden paralogy and substitution rate variation on two widely used network inference methods: find_graphs from ADMIXTOOLS 2 and SNaQ. Using an eight-taxon species tree calibrated from an empirical reptile phylogeny, we simulated data under various levels of hidden paralogy (from none to strong) and three levels of rate variation (none, gene-specific, and lineage-specific). We found that hidden paralogy had limited impact on network inference under the conditions examined: both network methods correctly favored a tree without reticulation, and ASTRAL recovered the correct species tree every time. In contrast, lineage-specific rates severely biased find_graphs, inflating worst f-statistic residuals well beyond the standard acceptance threshold. SNaQ correctly selected a tree model almost always across all conditions, though its network with h = 1 reticulation displayed the true species tree with a lower probability under lineage-specific rates. We also show that the standard worst residuals threshold of 3 for find_graphs produces inflated type I error even without rate variation, and we recommend empirical calibration of this threshold within each study system.

7

Summarizing Evolutionary Trajectories from Phylogenetic Character Maps of Discrete Traits

McHugh, S. W.; Landis, M. J.

2026-06-16 evolutionary biology 10.64898/2026.06.14.732171 medRxiv

Top 0.1%

53.0%

Show abstract

Abstract.--When reconstructing phylogenetic character histories, biologists aim to identify distinct evolutionary trajectories, or paths of character state evolution. However, biologists typically wish to summarize the information representing large numbers of potential character histories for a single phylogeny. For discrete characters, few approaches exist for summarizing the number of unique evolutionary trajectories beyond the frequency of specific events (i.e., state transition types) or the time lineages spend in each state. Here, we introduce a framework for summarizing the evolutionary trajectories of discrete character histories by compressing them into trajectory trees, where branches represent unique character-evolution pathways rather than lineages. This framework includes a novel compressed tree representation, called a scenario tree, that retains temporal information, ensuring that each root-to-tip path represents a unique, temporally explicit evolutionary trajectory. We describe and apply several approaches to summarize phylogenetic trees into transition trees. We include visual summaries - such as consensus trajectory trees, trajectory tree tanglegrams, and"trajectory-through-time plots" - to compare how unique evolutionary trajectories accumulate across lineages and state transitions. We also include quantitative summaries, such as the time spent in unique evolutionary trajectories and the number of transitions that follow unique character-state transitions. We use our new trajectory-wise summaries to evaluate the adequacy of commonly used continuous-time Markov models of character evolution, which are memoryless and consider only the rates between pairs of states. We conducted multiple simulation-based experiments demonstrating the utility of our novel trajectory-wise approaches. We also apply our new trajectory-wise approaches to Greater Antillean Anolis lizard biogeography and ecomorph evolution, and find that Anoles evolved along considerably more unique evolutionary trajectories than expected under simulations of our best-fitting character evolution model. The number of unique evolutionary paths accumulated in an "early burst" pattern relative to simulated trajectories, with this burst being more intense than expected across all character state transition events.

8

A hierarchical Bayesian framework accommodates intraspecific and interspecific variation in multivariate traits

Raskin, L. Y.; Seselj, M.; Huelsenbeck, J.; Lim, W.; Li, J. K.; Guatelli-Steinberg, D.; O'Hara, M. C.; Bitarello, B. D.

2026-05-31 evolutionary biology 10.64898/2026.05.28.728593 medRxiv

Top 0.1%

49.2%

Show abstract

Phylogenetic comparative methods are a critical tool in biology, providing the framework to test evolutionary hypotheses of phenotypic diversification. Accommodating intraspecific variation in these analyses is critical for accurate evolutionary inference, but current multivariate methods either assume traits evolve independently or that all taxa share the same intraspecific covariance structure. Violations of these assumptions can produce biased estimates of evolutionary parameters. Here, we introduce a hierarchical Bayesian framework for multivariate traits that jointly estimates taxon-specific intraspecific covariance structures alongside the underlying evolutionary process. This framework propagates uncertainty from sample size discrepancies and missing data, enabling the incorporation of highly variable morphological traits into phylogenetic analyses. Analyses of simulated data demonstrate that our framework achieves well-calibrated coverage (95%), whereas the standard practice of treating taxon means as known without error reduces coverage to 70%, confirming that ignoring intraspecific variation produces systematically biased evolutionary inference. We apply this framework to estimate the evolutionary rates and intraspecific distributions underlying perikymata distribution diversity in great apes, including modern humans and Neandertals. We show that, compared to other great apes, canine perikymata spacing in the genus Homo likely evolved under a substantially different regime than non-human apes, with cervical enamel evolving both rapidly and in a coordinated, modular fashion. We further find that Gorilla and Pongo show striking conservation in perikymata spacing relative to late Homo and Pan. These results and our method, which is applicable to other multivariate traits, provide the first phylogenetically rigorous characterization of an enamel growth trait across the great ape clade and establish intraspecific uncertainty propagation as a necessary component of multivariate phylogenetic analysis.

9

Is level-1 blob reconstruction under the network multispecies coalescent easy?

Dai, J.; Molloy, E.

2026-06-10 bioinformatics 10.64898/2026.06.06.730607 medRxiv

Top 0.1%

46.5%

Show abstract

Hybridization is an important evolutionary process, commonly modeled by the network multispecies coalescent. Reconstructing evolutionary histories under this model is notoriously costly, even for level-1 networks where hybridization events are isolated from each other. The widely used methods that combine speed with statistical guarantees rely on quartet concordance factors computed for all subsets of four species, resulting in an o(n4k) bottleneck that severely limits scalability to large numbers of species (n) and genes (k). Among quartet-based methods, NANUQ+ is notable because it decomposes the problem into two steps: first reconstructing a tree of blobs, which compresses each non-treelike part of the network, called a blob, into a single vertex, and second reconstructing the internal structure of each level-1 blob, specifically its circular order and hybrid vertex. Here, we investigate whether level-1 blob reconstruction is difficult once the tree of blobs is known. We present a fast and statistically consistent algorithm, called NetCS, based on two simple primitives: majority voting and merge sort, circumventing the bottleneck of computing all quartet concordance factors. In simulations, NetCS achieved comparable accuracy to NANUQ+ and was dramatically faster, enabling analyses of 200 taxa and 1000 genes in only a few minutes. Both methods attained near-perfect accuracy when given the true tree of blobs; however, their performance degraded in end-to-end pipelines due to errors in tree of blobs reconstruction. Strikingly, even methods that reconstruct level-1 networks directly struggled to accurately predict hybrid ancestry. Our results suggest that reconstructing level-1 blobs is unexpectedly easy once the tree of blobs is known, and that a major challenge for phylogenetic network inference lies in accurate tree of blobs reconstruction.

10

Statistical inference of the Tree of Blobs of a phylogenetic network from quartet concordance factors

Rhodes, J. A.; Allman, E. S.; Ane, C.; Banos, H.

2026-05-31 evolutionary biology 10.64898/2026.05.28.728501 medRxiv

Top 0.1%

44.8%

Show abstract

A phylogenetic network represents evolutionary relationships involving hybridization, gene flow, or admixture. While the full network may not be identifiable from genomic data under common coalescent models, its tree of blobs, depicting only the tree-like portions of the network structure, is. We introduce ECToBlob (Edge Contraction for Tree of Blobs), a new statistically-consistent algorithm to estimate the tree of blobs from quartet concordance factors. Starting from a resolved tree, ECToBlob successively contracts edges which statistical tests indicate do not belong in the tree of blobs, due to reticulate or polytomous signal. We show that ASTRAL provides a valid starting tree under common assumptions, in that, asymptotically in the number of loci, trees optimizing ASTRALs criterion refine the tree of blobs. We describe several algorithm variants, differing in how evidence from multiple tests are combined to determine if the edge should be contracted, and provide software implementations. Relevance to Life SciencesHybridization, gene flow, or admixture are now recognized as important aspects of evolutionary history, but their genomic signal is confounded with that from a coalescent process, creating substantial challenges for inferring phylogenetic networks. The networks tree of blobs identifies areas where reticulation occurred, separated by tree-like branching. ECToBlob quickly estimates the tree of blobs using quartet concordance factors from gene trees, and provides a measure of statistical support for its result. Performance is illustrated through simulation and on empirical data, using an implementation in the R package MSCquartets. While the presence of a blob may be all that can be inferred in some cases, in others ECToBlob offers a robust and principled way to focus further analyses on more local reticulate structure. Mathematical ContentThis work makes contributions to mathematical phylogenetics in optimization, combinatorics, and statistics. We show that any tree maximizing quartet support (the criterion underlying ASTRAL) is a refinement of the networks tree of blobs under the coalescent model. Second, we give a concise proof that whether a network has a cut-edge corresponding to a given split is determined by information in certain subcollections of its 4-taxon subnetworks (quarnets). Finally, we propose valid statistical approaches for combining p-values across multiple quarnet hypothesis tests, proving that their use with specific decreasing test levels leads to statistically consistent inference as the number of loci grows. MSC codes05C90, 60J95, 62-04, 62F07, 92D15

11

Evolutionary transitions to self-fertilization influence the inference of introgression history

Metzger, L.; de Meaux, J.; Rahnamae, N.; Tellier, A.

2026-05-26 evolutionary biology 10.64898/2026.05.21.726369 medRxiv

Top 0.1%

33.4%

Show abstract

The availability of polymorphism data and statistical inference methods allows documenting the widespread occurrence of introgression and hybridization across the tree of life. However, these methods are primarily optimized for outcrossing species without generation overlap or seed banking, thereby ignoring the consequences of life-history traits (and their evolution) on genome-wide polymorphism patterns. We investigate how a transition from outcrossing to selfing, a common feature of plant species, may affect the inference of introgression history. We simulate six demographic models with different histories of gene flow under two mating-system scenarios: a constant high selfing rate and a transition-to-selfing scenario. Using an Approximate Bayesian Computation framework with random forests, we compare model choice based on genotypic summary statistics alone and in combination with coalescent statistics derived from coalescent tree sequences. Including coalescent information substantially improves model classification, especially for distinguishing secondary contact and continuous gene flow. Cross-classification of pseudo-observed datasets shows that ignoring a transition to selfing can lead to false demographic inferences, with transition-to-selfing data often misclassified as ancient gene flow or secondary contact when analyzed under a constant selfing model. We then apply this inference framework to genomic data from Arabis nemorensis and Arabis sagittata, two predominantly selfing species with evidence for post-split hybridization. Our analyses reveal a likely transition to selfing roughly 470,000-890,000 years ago, and a likely continuous level of gene flow after the species split. The latter results lead us to revisit our previous scenario of gene flow due to secondary contact between species inferred under constant selfing. Changes in mating systems and, by extension, life-history traits can therefore bias inference about introgression if they are not explicitly modeled. Tree-sequence-based coalescent statistics provide useful information for inferring complex demographic histories that involve both gene flow and transitions to selfing.

12

A probabilistic and phylogenetic principal component analysis for modelling high-dimensional trait evolution

Montoya, P.; Joseph, J.; Goswami, A.; Morlon, H.; Clavel, J.

2026-05-29 evolutionary biology 10.64898/2026.05.27.728209 medRxiv

Top 0.1%

30.9%

Show abstract

Given the ever-increasing availability of highly detailed phenotypes, modelling trait evolution in a multivariate framework is becoming a challenging task. Current phylogenetic comparative methods often struggle with high-dimensional datasets because they suffer computational limitations and interpretability. Here, we propose a maximum likelihood-based approach called Probabilistic and Phylogenetic Principal Components Analysis (P3CA) to circumvent current limitations. This approach is based on a continuous latent variable model, whereby observed traits are explained by a smaller number of unobserved variables that evolve according to a given evolutionary model. We implement the approach under Pagels lambda model using an Expectation-Maximisation algorithm that makes it computationally efficient and allows missing values. Using simulations, we demonstrate that evolutionary parameters are accurately estimated, regardless of phylogenetic signal, the number of traits or the proportion of missing values. The reconstruction of the reduced space is more accurate than the one obtained using other dimensionality reduction approaches, such as phylogenetic and conventional PCA. Likewise, the estimated values for missing data are more accurate than those obtained using current phylogenetic data imputation approaches. We illustrate the approach on a 3D geometric morphometric dataset describing Crocodyliformes skull shapes and containing around 4% of missing data. Our P3CA method unlocks the possibility to analyse and more easily interpret the large-scale multivariate datasets generated in recent decades within a phylogenetic comparative framework.

13

Phylogenetic tree inference using generative models

Dotan, E.; Schers, A.; Wygoda, E.; Pupko, T.; Belinkov, Y.

2026-06-16 bioinformatics 10.64898/2026.06.14.732140 medRxiv

Top 0.1%

27.9%

Show abstract

Accurate inference of phylogenetic trees is fundamental to evolutionary biology, yet existing methods rely on complex pipelines involving multiple sequence alignment, explicit evolutionary models, and computationally intensive tree search procedures. Here, we present BetaInfer, a generative framework that reformulates phylogenetic tree inference as a sequence transduction problem. BetaInfer leverages hybrid transformer-based architectures to directly map sets of unaligned sequences to phylogenetic trees represented in Newick format. Trained on large-scale simulated evolutionary data with known ground truth, BetaInfer learns to capture complex evolutionary signals directly from sequence data. Ensemble-based generation of multiple candidate trees further improves robustness, reducing reconstruction error by over 30% relative to single predictions. Across extensive evaluations on both simulated and empirical datasets, BetaInfer achieves competitive performance relative to state-of-the-art phylogenetic pipelines, matching, and in some cases exceeding, the accuracy of established likelihood-based and distance-based methods under a wide range of conditions. Interpretability analyses reveal that BetaInfer leverages internal pairwise-distance computations to synthesize evolutionary relationships into an integrated, global representation that supports direct tree generation. Together, these results demonstrate that generative models can serve as a viable and scalable alternative to standard phylogenetic pipelines.

14

Guide-tree bias of whole genome alignment can mislead phylogenomic analyses

Tao, Q.; Grünewald, S.

2026-07-09 evolutionary biology 10.64898/2026.07.06.736671 medRxiv

Top 0.1%

26.4%

Show abstract

Whole-genome alignment (WGA) is widely used for genome-scale phylogenetic inference, and most scalable WGA pipelines rely on progressive alignment guided by a pre-specified tree. Among progressive whole-genome aligners, Progressive Cactus is a successful state-of-the-art method. However, analyses of real and simulated avian data indicate that guide-tree choice can influence downstream tree inference; star guide trees do not remove this effect and can exacerbate long-branch attraction artefacts. We have developed a consensus strategy based on the Progressive Cactus framework by generating a small set of alternative guide-tree alignments and retaining only homology relationships consistently recovered across all alignments. In simulation experiments, consensus alignments improve precision, bring inferred site-pattern frequency distributions closer to those of the true alignments, and recover more true splits than single guide-tree alignments. In a real landbird (Telluraves) dataset, we observe a strong bias towards single binary guide trees and long-branch attraction for less resolved trees. While the reconstructed tree still depends on the phylogenetic method and taxa sampling, our consensus alignment has no clear bias. We implemented a hierarchical consensus workflow that only locally resolves uncertainty in the guide tree. Therefore, the computational cost increases only moderately, for example by an estimated 68 percent for a recently published large-scale alignment of more than 300 modern birds (Neoaves) taxa.

15

The phylogenetic affinities of Chaetognathifera, with considerations of systematic error and the robusticity of macrosyntenic results

Fleming, J. F.; Roberts, N. G.; Herlyn, H. F.; Ahlrichs, W.; Kocot, K.; Struck, T. H.

2026-06-08 evolutionary biology 10.64898/2026.06.08.730799 medRxiv

Top 0.1%

18.3%

Show abstract

Chaetognathifera, a superphylum comprising Syndermata (Rotifera including Acanthocephala), Micrognathozoa, Gnathostomulida and Chaetognatha, is a complex grouping generally recovered as the sister to all other Lophotrochozoa. However, phylogenetic relationships within this group are controversial, in part due to poor sampling, resulting in two key questions. The first is whether Gnathostomulida or Chaetognatha represent the sister group to Syndermata+Micrognathozoa. The second is the phylogenetic position of the former phylum Acanthocephala within Syndermata. Here, we present the first study of the phylogenetic affinities of Chaetognathifera with genomic representation from all major phyla, and explore the potential of macrosynteny to better understand these relationships. For this latter aspect, we also developed a new jackknifing procedure to assess the robustness of linkage groups inferred by macrosyntenic analyses. We show that the phylogenetic relationships between these clades are corroborated through a variety of gene selection and analysis methodologies. This provides clear evidence of Acanthocephala as a derived clade within Syndermata as sister to Seisonidea, and that Gnathostomulida is sister to Syndermata+Micrognathozoa, with Chaetognatha as the earliest diverging clade within Chaetognathifera. On the other hand, we found that macrosyntenic patterns cannot resolve this question. Moreover, almost all possible linkage groups involving chaetognathiferan species lack robusticity and hence, should not be considered reliable. As a consequence so far, in Chaetognathifera none of the bilaterian ancestral linkage groups can be reliably found and independent massive chromosomal rearrangements occurred. We therefore strongly suggest that studies of macrosynteny should not only assess the significance of possible linkage groups, but also the robusticity of these linkage group inferences. Furthermore, we also present a script for this purpose, which can be found at: https://github.com/JFFleming/MacrosyntenicJackknife

16

The hypercubic Mk model in reduced state space for the coupled, reversible coevolution of multiple binary characters

Johnston, I.; Diaz-Uriarte, R.; Boyko, J.

2026-06-03 evolutionary biology 10.64898/2026.06.01.729317 medRxiv

Top 0.1%

18.0%

Show abstract

Many scientific questions involve the coevolution of coupled, binary features over time - from phenotypes in evolutionary biology to mutations in cancer development. Evolutionary accumulation models (EvAMs) often neglect reversibility in these systems, uncertainty in observations, and/or phylogenetic connections between observations. By contrast, the Mk model from phylogenetic comparative methods supports reversibility, uncertainty, and relatedness, but compute time scales like O(4L) in number of features L, making it challenging to apply to more than about six coupled, coevolving binary characters. Here, we introduce HyperMk2, a method using output from a Fitch-like parsimony algorithm to reduce the state space associated with many coevolving characters while retaining flexibility, reversibility, and phylogenetic information. This approach, while approximate, scales linearly in the number of distinct observations rather than exponentially in the number of characters, supporting the investigation of much larger systems than previously possible. We demonstrate how this method allows the inference of evolutionary dynamics of anti-microbial resistance in bacteria, including the identification of potential influences between characters, and discuss its broader application.

17

Resolving the oak tree of life: comparing RADseq and whole genome resequencing methods for oak phylogenetics

Hipp, A. L.; Althaus, K. N.; Fuller, E. L.; Hahn, M.; Larson, D. A.; Mohn, R. A.; Wang, B.; Manos, P. S.

2026-05-17 evolutionary biology 10.64898/2026.05.14.725274 medRxiv

Top 0.1%

17.0%

Show abstract

Forest trees pose numerous potential challenges to phylogenomic inference. Their large effective population sizes and relatively long generation times lead to deep allele coalescence and consequently incomplete lineage sorting (ILS), which biases inferences of divergence times toward older ages and introduces gene tree discordance. Deep phylogenetic divergences, reaching back into the Paleocene, introduce reference-mapping biases. Introgression--the movement of genes between lineages--may result in different phylogenies being inferred depending on which individuals are included in analysis, even if the plurality of the genome favors the divergence history unaffected by introgression. These factors influence phylogenetic inference across the Tree of Life but are particularly prevalent in forest trees. Oaks (Quercus) are notable for all three influences. In addition, our knowledge of the oak phylogeny is currently based strongly on restriction site associated DNA sequencing (RADseq) datasets published over the past decade, which may introduce additional sources of uncertainty. In this chapter, we analyze a 322-species RADseq dataset and genome resequencing data from across the genus to address sources of uncertainty in our understanding of the global oak phylogeny, which we hope will serve as a model for other research groups working on comparable woody plant groups.

18

Beyond infinite sites: Generalized ABBA-BABA statistic for deeper phylogenies

Zhang, C.; Nielsen, R.

2026-07-08 bioinformatics 10.64898/2026.07.06.736715 medRxiv

Top 0.1%

15.0%

Show abstract

The Patterson's D statistic detects gene flow from ABBA-BABA site patterns, but its biallelic site patterns fail under deeper divergences where multiple hits cause false positives. We propose two extensions, D+ and D*. Both incorporate multiallelic site patterns to reduce saturation bias under JC and F84 model. Simulations show that D+ and D* both remain correctly null under all conditions and detect gene flow effectively, with distinct advantages: D+ guarantees non-negativity of the denominator, while D* provides greater robustness when mutation rates vary across genomic regions. The source code and binary files are publicly available at https://github.com/chaoszhang/ASTER.

19

Traits and hidden states: is self-fertilization associated with rates of diversification across mating and sexual systems?

Meyer, E. M.; Rosenberg, M. S.; Boyd, B. M.; Eckert, A. J.

2026-06-20 evolutionary biology 10.64898/2026.06.18.733044 medRxiv

Top 0.2%

15.0%

Show abstract

Reproductive biology is a key determinant of fitness. State-dependent speciation-extinction methods (SSEs) are often used to associate traits with patterns of diversification. Previously, SSEs have been used within families to investigate the hypothesis that selfing is an evolutionary "dead end." To the best of the authors knowledge, no study has looked across families, which would increase power to more generally test this hypothesis. Here, we examine the impact of 1) mating system and 2) sexual system on diversification across 18 phylogenetically diverse families. We also discuss how more recent advances in SSE models (i.e., "hidden state" and tree-only models) influence our interpretation of these patterns and evaluate how the relationship between mating and sexual systems can be leveraged to gain insight into the impact of reproductive biology on evolutionary outcomes. In this study, we find that the mating system as a trait does not better explain patterns of diversification when compared to null models, but the sexual system often does. We also find family-level heterogeneity in our results, which suggests conclusions drawn from studies on individual families may not be consistent with any broader trend.

20

GTRspmix: Capturing Heterogeneity of Exchangeabilities Across Sites to Improve Protein Phylogenetics

Harada, R.; Susko, E.; Wong, T. K. F.; Banos, H.; Ly-Trong, N.; Lanfear, R.; Theobald, D. L.; Minh, B. Q.; Roger, A. J.

2026-06-18 evolutionary biology 10.64898/2026.06.18.729217 medRxiv

Top 0.2%

15.0%

Show abstract

Site rate and profile mixture models capture the heterogeneity of the amino acid substitution process across sites. However, these models typically use a single matrix of amino acid exchangeabilities and ignore potential heterogeneities of these exchangeabilities across sites. Simply combining multiple exchangeability matrices with rate and profile mixtures leads to a combinatorial explosion of mixture components and a prohibitive increase in free parameters. Here, we introduce GTRspmix, a novel framework that incorporates multiple exchangeability matrices into profile and site rate mixture models while effectively managing model complexity. GTRspmix employs a clustering-based strategy that groups profiles and assigns a distinct exchangeability matrix to each profile cluster. Evaluations using both empirical and simulated datasets demonstrate that GTRspmix fits empirical data significantly better than conventional models, and that overparameterization does not present a problem for sufficiently large alignments. Based on these results, we estimated general-purpose empirical models (SXXpfamCYY series available in IQ-TREE3) from the Pfam database. These general-purpose models not only fit data much better, but they also influence branch length and tree topology estimates, effectively mitigating long-branch attraction artifacts. Because the total number of rate matrices remains manageable, the computational efficiency of the inference is identical to that of conventional profile mixture models (e.g., LG+C60+G4). GTRspmix provides a more realistic and flexible model of protein evolution, offering a robust foundation for the inference of reliable phylogenetic trees.